Comparison of Inversion Algorithms for Wilson Fermions on the CM5* 

Rajan Gupta, a Tanmoy Bhattacharya, and Gregory Kilcup a 

a T-8 Group, MS B285, Los Alamos National Laboratory, Los Alamos, New Mexico 87545 U. S. A. 
a Physics Department, The Ohio State University, Columbus, OH 43210 U. S. A. 

This talk presents results of a comparitive study of iterative algorithms like minimal residue (AIR) and conjugate 
gradient (CG, BiCG^, and BiCGstab) used for inverting the Dirac matrix M. The tests were done on the 
Connection Machine CM-5 using 32 3 x 64 lattices. The fermion action used is of Wilson type, both with and 
without the clover term. The overall conclusion is that preconditioned over-relaxed MR is the simplest, uses 
the least memory, and is comparable in performance to BiCGstab. We find these two algorithms to be equally 
robust, i.e. insensitive to the starting solution and to round-off errors. 



1. INTRODUCTION 

The solution of the Dirac equation for an ar- 
bitrary source constitutes the dominant fraction 
of computer time used in the numerical simu- 
lation of both quenched and unquenched QCD. 
In quenched simulations, the Dirac propagator 
is used in the construction of correlation func- 
tions of observables involving quark fields. De- 
pending on how many observables are measured 
using a given set of quark propagators, the ma- 
trix inversion is 40 — 70% of the total CPU time. 
In full QCD simulations this number jumps to 
> 90% as matrix inversion is also needed in the 
update. Large scale simulations (quenched and 
unquenched) currently consume the equivalent of 
many giga-flop years on a variety of supercomput- 
ers. Thus, fast solvers are the key to progress, and 
we consider algorithm improvements by > 20% 
significant. In this talk we catalogue and com- 
pare the efficiency of commonly used algorithms 
(Minimal Residue (MR), conjugate gradient (CG, 
BiCGjs, and BiCGstab)) for Wilson and Clover 
class of actions. Further details of the lattice gen- 
eration, the calculation of quark propagators, and 
the analysis of the meson and baryon spectrum 
are given in Ref . [Q . 

*Talk presented by Rajan Gupta at the workshop "Ac- 
celerating Fermion Algorithms", Jiilich, Feb 1996. The 
calculations reported here have been done on the CM5 at 
LANL as part of the DOE HPCC Grand Challenge pro- 
gram, and at NCSA under a Metacenter allocation. 



2. TECHNICAL DETAILS 

The Dirac matrix discretised a la Wilson on a 
4-dimensional hypercubic grid can be written as 

M = 1 + nD(r) (1) 

where the matrix D(r) connects a given site to its 
8 nearest neighbours, r is the Wilson parameter 
introduced to remove the doubling problem, and 
k is a parameter in terms of which the quark mass 
is defined as 

m ^i{l-£) (2) 

where a is the lattice spacing and k c denotes the 
chiral limit at which the matrix becomes singular. 
The quark propagator x is the solution to the set 
of linear equations 

(1 + KD(r)) X = <t> (3) 

where (f> is an arbitrary source vector. For a delta 
function source it represents the propagation of a 
quark from that point to all other points on the 
lattice. 

2.1. Properties of the Dirac matrix 

The Dirac matrix satisfies the identity 
75M75 = Aft which is retained by both the 
Wilson and Staggered discretized versions. As 
a result the eigenvalues of M come in complex- 
conjugate pairs. For the staggered theory, they lie 
along a line in the imaginary direction at m q . For 
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Figure 1. Distribution of eigenvalues for a 4 4 lat- 
tice with Wilson action in the confined phase at 
(3 = 5.0 and k — 0.14. Data provided by A. Borici 
and P. deForcrand 101. 



the Wilson theory they are distributed as shown 
in Fig. [l]. The eigenvalues lying along the real axis 
in the confined phase acquire an imaginary part 
at the chiral/deconfinement transition. I believe 
that these real eigenvalues play a central role in 
the rate of convergence, and a study of their dis- 
tribution and change with preconditioning may 
provide clues to developing better algorithms. 



2.2. Convergence Criteria 

We define the remainder at the n th iteration 
to be r„ = Mxn — 4> and in the figures plot the 
convergence condition C = \r\ 2 /\x\ 2 - On 32 3 x 64 
lattices, 32 bit IEEE precision restricts the pre- 
cision to which C can be measured to w 10~ 14 
due to round-off errors. In the figures we plot the 
iterated residue, so we compare algorithms us- 
ing the same preconditioned matrix. Also, since 
our final stopping criteria is with respect to the 
true residue, a comparison of the total number of 
multiplies by the original Dirac matrix M is also 
meaningful, irrespective of the preconditioning. 



Table 1 

Operations per iteration in the various algorithms 
and number of stored vectors. 
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2.3. Polynomial Preconditioning 

The following pre-conditioned systems are used 
to solve for x H 



(1 - k 2 D 2 )x = M'Mx = (1 - nD)(j> 
(1 - K 4 D 4 ) X = (1 + k 2 D 2 ){1 - nD)(j) 



(4) 



where we define M' = 1 — nD(r). We label these 
second and fourth order pre-conditioned systems 
by adding subscripts to the algorithm name. The 
second-order form is equivalent to red/black pre- 
conditioning and one can halve the number of op- 
erations by solving for only the red (or black) sites 
and reconstructing the other from these. On the 
CM5 the necessary reorganization of data arrays 
to make use of this simplification has not been 
done. The tests have been performed solving for 
the full matrix as they do not affect the number 
of matrix multiplies needed (the red and black 
systems are solved independently and simultane- 
ously) . 

2.4. Staggered Fermions 

The staggered version of the Dirac matrix has 
the form m q + iA, where A is hermitian. Thus, 
the hermitian product M'M = (m 2 + A 2 ) is also 
second order preconditioned. For such systems 
CG has proven to be the optimal algorithm pjlj. 

3. ALGORITHMS 

In this section we give the precise implementa- 
tion of the various algorithms since the speed of 
convergence can depend on the details. We also 
give the performance characteristics of these al- 
gorithms, saving a comparitive study for the next 
section. 
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3.1. Minimal Residue 

The minimal residue algorithm consists of the 
following steps: 

do i = 1, until .converged 

a = (Mn-i,n-i) I (Mn-i, Mrw) 

Xi = Xi-i -u ol r,_i 

n = n-i — us a M r, t -i 
test convergence 
enddo (5) 

where a is a complex number of O(l), and the 
order of the hcrmitian dot product is important. 
(The algorithm fails to converge if we use a* in- 
stead of a, and is about 50% slower if we use 
real (a)). For initial guess one can use \o = or 
any other trial vector. The only restriction seems 
to be that (Mr,_i,r,_i) should not be small. 
This situation arises for \n produced by the CG 
class of algorithms. Thus, MR should NOT be 
used following even a few CG steps. 

The parameter us in Eq. ^ is an over-relaxation 
parameter. We find that, to within few percent, 
the convergence rate is insensitive to u) in the 
range 1.1 — 1.35, and gives the envelope of best 
convergence. For 1.0 < u> < 1.08 the conver- 
gence fluctuates between the ui = 1 (poorest) and 
u) = 1.1 (best) cases as shown in Fig. 2. 

We find that higher orders of polynomial pre- 
conditioning extend the range of applicability (to- 
wards smaller quark masses) of the MR algo- 
rithm. There is a ~ 10% degradation in CPU 
time with each order, MR4 takes roughly 10% 
more computer time but can be used for lighter 
quark masses . 

3.2. Conjugate Gradient 

The CG method for a non-hermitian matrix is 

do i = 1, until .converged 

a= |M+ri-i| 2 /|M ft _i| a 

Xi = Xi-i - agi-i 
n = - aMg^i 

/3 = |Aft ri | 2 /|Mtr i _i| a 
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Figure 2. Behavior of MR4 versus the over- 
relaxation parameter u). The band for 1.1 < u> < 
1.35 shows the insensitivity of the algorithm to 
the precise value of u> in this range. The case 
uj = 1.05 shows a step like behavior bounded by 
the "best" and normal (ui = 1.0) case. The spikes 
at the end are due to refreshing the remainder. 



9i = 09i-i + AfVi 
test convergence 
enddo (6) 

where both a, are complex numbers and we 
set go — M'ro. For M we use the second or- 
der preconditioned (non-hermitian) matrix de- 
fined in|^. The CG algorithm shows two distinct 
rates of convergence, a fast initial drop, and a 
slower asymptotic rate that is relevant for light 
quarks and makes CG less competitive than M R 
or BiCGstab. The only cases where we have 
seen some gain due to the fast initial conver- 
gence behavior of CG is on switching from MR or 
BiCGstab at late stages of convergence, and even 
then only for light quarks. This is exemplified in 
Fig. I ' 
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Figure 3. Comparison of CG2, MR2 and 
BiCGstabi algorithms. The breaks in the conver- 
gence of CG2 are due to refreshing the remainder, 
however, refreshing does not affect the aymptotic 
convergence rate. Data at all values of quark mass 
show that CG2 is not competitive with MR2 or 
BiCGstab^ however, switching to CG in the final 
stages of MR or BiCGstab algorithms enhances 
their convergence and makes them comparable. 



3.3. BiCGjs 

The BiCGjs algorithm exploits the property 
75-^75 = M' of Dirac fermions to yield a stabi- 
lized version of CG algorithm for the matrix M 
itself § 

do i = 1, until _converged 

a = (3i_i,753i-i)/(5i-i)75-Wffi-i) 
Xi = Xi-l ~ "Si-l 
n = n_i - aMgi-i 

= (n,i5n)/(ri-i,j5n-i) 
fji = Pgi-i + n 

test convergence 
enddo (7) 

where a and are real and go = tq. This al- 
gorithm, in our opinion, has two drawbacks as 
shown in Fig. |]; it is sensitive to the initial Xo> 



Figure 4. Comparison of BiCGj§ and 
BiCGstabi algorithms. Both solve for \ with 
respect to the un-preconditioned matrix 1 + kD. 



for example the choice xo = is bad as the al- 
gorithm actually blows up. A second pecularity 
of this algorithm is that the convergence shows 
very large fluctuations with a period of ~ 10 iter- 
ations. Overall, we do not see any advantage to 
using it since the asymptotic convergence rate is 
no better than MR or BiCGstab. 

Forcrand and Borici have raised the possibility 
that the problems could be due to round-off errors 
in the accumulations. For this purpose we ran our 
tests with dot products done in 64 bit precision. 
This change did not improve the convergence. 

3.4. BiCGstab 

The steps in the BiCGstab algorithm are || 

do i = 1, until ..converged 

0= {fo,ri-i)/{fo,ri-2) 

w<-i 

Pi = n-i + p{pi-i - wvi-i) 

Vi = Mpi 

a = (fo,n-i)/(f ,Vi) 
Si — r,_i - avi 
U = Msi 



5 



/c=0.1563 (M^O.185) 



iCGstabn 




1000 2000 
§ of Matrix Multiplies 



Figure 5. Performance details of BiCGstabi al- 
gorithm. The effects of refreshing the remainder 
and switching to CG are highlighted. 



UJi = (ti, Si) I '(ti,ti) 

Xi = Xi-i + + apt 

n = Si - uu 

test convergence 



enddo 



(8) 



where a, /3, uj are complex numbers initialized to 
(1, 0), v = Po = 0, and f = r = cj> - Mxo- We 
have tested this algorithm for two cases of M, the 
original Dirac matrix and the second-order pre- 
conditioned version and label these as BiCGstabi 
and BiCGstabi respectively. 

The sharp spikes at count 600 and 1200 in 
Fig. H result from refreshing the remainder. The 
data show that the convergence rate is not af- 
fected by these spikes. The sharp drop at count 
1900 is caused by a switch to CG and is apparent 
only for the smallest quark mass. 

4. Clover action 

In Fig. ^| we show the convergence behavior of 
Mi?2 and BiCGstabi algorithms when inverting 
the tadpole improved clover action at j3 = 6.0. 
The dependence on gauge configurations is exhib- 
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Figure 6. Comparison of MR2 (solid lines) and 
BiCGstabi (dotted) algorithms. In both cases 
data are shown for six distinct lattices. 



ited by using 6 distinct lattices and the lightest 
quark mass (m^ = 0.185). Our conclusion is that 
the performance of both algorithms is compara- 
ble. If one is concerned about the stability of 
MRi, then one can use MR4. The overhead of 
the clover term is w 15%. 
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